class: center, middle, inverse, title-slide # PLSC30500, Fall 2021 ## 1.2 Data visualization --- class: inverse, middle, center # Toward a grammar of graphics --- class: bg-full background-image: url("data:image/png;base64,#assets/rosling_youtube.png") background-position: center background-size: contain ??? Source: https://www.youtube.com/watch?v=jbkSRLYSojo --- class: bg-full background-image: url("data:image/png;base64,#assets/rosling_youtube_zoom.png") background-position: center background-size: contain ??? Source: https://www.youtube.com/watch?v=jbkSRLYSojo --- # Mapping attributes to aesthetics Q: What is the **unit of observation**? A: A country in a year Q: How are a country-year's **attributes** mapped to **aesthetic** components of the graphic? <table> <thead> <tr> <th style="text-align:left;"> Attribute </th> <th style="text-align:left;"> Aesthetic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Income </td> <td style="text-align:left;"> Horizontal position (x) </td> </tr> <tr> <td style="text-align:left;"> Life expectancy </td> <td style="text-align:left;"> Vertical position (y) </td> </tr> <tr> <td style="text-align:left;"> Population </td> <td style="text-align:left;"> Size of point </td> </tr> <tr> <td style="text-align:left;"> Continent </td> <td style="text-align:left;"> Color of point </td> </tr> </tbody> </table> --- class: bg-full background-image: url("data:image/png;base64,#assets/Minard.png") background-position: center background-size: contain ### Minard's graphic on Napoléon in Russia ??? One of the "best statistical drawings ever created" (Tufte, *VDQI*) Source: [Wikipedia](https://en.wikipedia.org/wiki/File:Minard.png) --- # Mapping attributes to aesthetics Q: What is the **unit of observation**? A: An army (army division) on a day ("army-day") Q: How are an army-day's **attributes** mapped to **aesthetic** components of the graphic? <table> <thead> <tr> <th style="text-align:left;"> Attribute </th> <th style="text-align:left;"> Aesthetic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Longitude </td> <td style="text-align:left;"> Horizontal position (x) </td> </tr> <tr> <td style="text-align:left;"> Latitude </td> <td style="text-align:left;"> Vertical position (y) </td> </tr> <tr> <td style="text-align:left;"> Number of surviving soldiers </td> <td style="text-align:left;"> Width of line </td> </tr> <tr> <td style="text-align:left;"> Direction (advance, retreat) </td> <td style="text-align:left;"> Color of line </td> </tr> </tbody> </table> (Also note secondary plot showing temperature during retreat.) --- # Data: structure Our data is typically **rectangular**, with rows and columns like a spreadsheet. -- Usually, - each row should be one observation (e.g. country-year, army-day) - each column should contain one attribute (e.g. life expectancy, number of surviving troops) -- For example: <table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 43.828 </td> <td style="text-align:right;"> 31889923 </td> <td style="text-align:right;"> 974.5803 </td> </tr> <tr> <td style="text-align:left;"> Albania </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 76.423 </td> <td style="text-align:right;"> 3600523 </td> <td style="text-align:right;"> 5937.0295 </td> </tr> <tr> <td style="text-align:left;"> Algeria </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 72.301 </td> <td style="text-align:right;"> 33333216 </td> <td style="text-align:right;"> 6223.3675 </td> </tr> <tr> <td style="text-align:left;"> Angola </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 42.731 </td> <td style="text-align:right;"> 12420476 </td> <td style="text-align:right;"> 4797.2313 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 75.320 </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 12779.3796 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> Oceania </td> <td style="text-align:right;"> 81.235 </td> <td style="text-align:right;"> 20434176 </td> <td style="text-align:right;"> 34435.3674 </td> </tr> <tr> <td style="text-align:left;"> Austria </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 79.829 </td> <td style="text-align:right;"> 8199783 </td> <td style="text-align:right;"> 36126.4927 </td> </tr> <tr> <td style="text-align:left;"> Bahrain </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 75.635 </td> <td style="text-align:right;"> 708573 </td> <td style="text-align:right;"> 29796.0483 </td> </tr> <tr> <td style="text-align:left;"> Bangladesh </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 64.062 </td> <td style="text-align:right;"> 150448339 </td> <td style="text-align:right;"> 1391.2538 </td> </tr> <tr> <td style="text-align:left;"> Belgium </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 79.441 </td> <td style="text-align:right;"> 10392226 </td> <td style="text-align:right;"> 33692.6051 </td> </tr> </tbody> </table> ??? Data in this format is sometimes referred to as "tidy" [(Wickham 2014)](https://vita.had.co.nz/papers/tidy-data.pdf). I think this concept is useful as long as you recognize that the definition of "unit of observation" (and thus attribute/variable) depends on the purpose for which the data is being used. --- class: inverse, middle, center # Making (beautiful and informative) graphics --- # Making graphics in `R` We will use the `ggplot2` library, which is part of the `tidyverse` library. Basic components of plotting with `ggplot`: - data - mapping of attributes (columns of data) to aesthetics - geometric representations of data (`geom`s) --- class: inverse, middle, center # Quick detour: getting data into `R` --- # Getting data into `R` (and `RStudio`) An interactive option: `Import Dataset` button in `Environment` pane of `RStudio` -- But note it's showing you the code it's using! (Live coding example.) --- # Getting data into `R` (cont'd) Most commonly used functions: - `read_csv()` and `read_rds()` in `readr` (`tidyverse`) - `readstata13::read.dta13()` for Stata files (`.dta`) - `readxl::read_excel()` for Excel files (`.xls`, `.xlsx`) - `load()` in base R for "R objects" All require "path" to file as argument. -- Sometimes data is a package, e.g. `babynames`, `gapminder`, `vdemdata` -- See Chapter 11 of R4DS and "Data Import" cheatsheet. --- ## A simple example Get data in: ```r gapminder_2007 <- gapminder::gapminder %>% filter(year == 2007 & continent != "Oceania") # will explain next week ``` Make a plot: ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` --- ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> To note: - the **arguments** to `ggplot()` say what the data is (`data = gapminder_2007`) and how attributes are mapped to aesthetics (`mapping = aes(x = gdpPercap, y = lifeExp)`) - `geom_point()` says "plot a point for each observation" - **layers** of plot linked with plus sign (`+`) --- <!-- Let's map population to the size of the points: --> ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, * size = pop)) + geom_point() ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- <!-- Let's map continent to the color of the points: --> ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, * col = continent)) + geom_point() ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- <!-- Let's put the x-axis on the log scale: --> ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, col = continent)) + geom_point() + * scale_x_log10() ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Minard data <table class="table" style="font-size: 15px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> long </th> <th style="text-align:right;"> lat </th> <th style="text-align:right;"> survivors </th> <th style="text-align:left;"> direction </th> <th style="text-align:right;"> group </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 24.0 </td> <td style="text-align:right;"> 54.9 </td> <td style="text-align:right;"> 340000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 24.5 </td> <td style="text-align:right;"> 55.0 </td> <td style="text-align:right;"> 340000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 25.5 </td> <td style="text-align:right;"> 54.5 </td> <td style="text-align:right;"> 340000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 26.0 </td> <td style="text-align:right;"> 54.7 </td> <td style="text-align:right;"> 320000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 27.0 </td> <td style="text-align:right;"> 54.8 </td> <td style="text-align:right;"> 300000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 28.0 </td> <td style="text-align:right;"> 54.9 </td> <td style="text-align:right;"> 280000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 28.5 </td> <td style="text-align:right;"> 55.0 </td> <td style="text-align:right;"> 240000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 29.0 </td> <td style="text-align:right;"> 55.1 </td> <td style="text-align:right;"> 210000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 30.0 </td> <td style="text-align:right;"> 55.2 </td> <td style="text-align:right;"> 180000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 30.3 </td> <td style="text-align:right;"> 55.3 </td> <td style="text-align:right;"> 175000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 32.0 </td> <td style="text-align:right;"> 54.8 </td> <td style="text-align:right;"> 145000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 33.2 </td> <td style="text-align:right;"> 54.9 </td> <td style="text-align:right;"> 140000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 34.4 </td> <td style="text-align:right;"> 55.5 </td> <td style="text-align:right;"> 127100 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 35.5 </td> <td style="text-align:right;"> 55.4 </td> <td style="text-align:right;"> 100000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 36.0 </td> <td style="text-align:right;"> 55.5 </td> <td style="text-align:right;"> 100000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 37.6 </td> <td style="text-align:right;"> 55.8 </td> <td style="text-align:right;"> 100000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 37.7 </td> <td style="text-align:right;"> 55.7 </td> <td style="text-align:right;"> 100000 </td> <td style="text-align:left;"> R </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 37.5 </td> <td style="text-align:right;"> 55.7 </td> <td style="text-align:right;"> 98000 </td> <td style="text-align:left;"> R </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> --- ```r ggplot(data = minard, mapping = aes(x = long, y = lat, size = survivors, col = direction, group = group)) + * geom_path() ``` <!-- --> ??? "`geom_path()` connects the observations in the order in which they appear in the data. `geom_line()` connects them in order of the variable on the x axis. `geom_step()` creates a stairstep plot, highlighting exactly when changes occur." Source: https://ggplot2.tidyverse.org/reference/geom_path.html --- # Back to `gapminder` ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, col = continent)) + geom_point() + scale_x_log10() ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Adding a smoothing line `geom_smooth()` adds a "smoother". Let's try adding it! -- ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, col = continent)) + geom_point() + scale_x_log10() + * geom_smooth() ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> Hmm. ??? If you don't exclude Oceania, `ggplot` refuses to make a smoother using default settings because there are too few countries in Oceania. --- # Inheritance of aesthetics ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) + * geom_point(aes(col = continent)) + scale_x_log10() ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## Data summary w. `geom_smooth()` ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) + geom_point(aes(col = continent)) + scale_x_log10() + * geom_smooth() ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## Linear version w. `geom_smooth()` ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) + geom_point(aes(col = continent)) + scale_x_log10() + * geom_smooth(method = lm) # lm means "linear model" ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ??? Anything after # (on the same line) is a "comment" and is ignored by R. This is useful for explaining to humans what is going on in the code. --- ## Small multiples: `facet_wrap()` ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) + geom_point() + scale_x_log10() + geom_smooth(method = lm) + * facet_wrap(vars(continent)) ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- # Other geoms ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp)) + * geom_density2d() + scale_x_log10() ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- # Other geoms ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp)) + * geom_label(aes(label = country)) + scale_x_log10() ``` <img src="data:image/png;base64,#slides_12_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- # How to learn more Practice and experiment. (And do problem sets.) Resources: - *R For Data Science* - RStudio primers - RStudio "Data Visualization" cheat sheet - Google - StackOverflow --- class: bg-full background-image: url("data:image/png;base64,#assets/data_viz_cheatsheet.png") background-position: center background-size: contain ??? Source: [RStudio Cheatsheets](https://www.rstudio.com/resources/cheatsheets/) --- # Back to the big picture Components of a `ggplot`: - data, with observations in rows, attributes in columns - mapping of attributes to aesthetics (x, y, size, shape, color, transparency, etc) - geometric objects (`geom`s) Next: getting data into the right format for plotting (and analysis). --- <!-- # Digression about coding vs clicking --> <!-- --- --> # Assignment Before next lecture, RStudio Cloud primers: - "Work with data" (https://rstudio.cloud/learn/primers/2): `tibble`, `select()`, `filter()`, `arrange()`, `%>%`, `summarize()`, `group_by()`, `mutate()` - "Data vizualization" (https://rstudio.cloud/learn/primers/3): more practice with `ggplot` Alternatively, Chapters 5&6 of R4DS. <!-- ## Plan --> <!-- First, examples where data columns already map to desired aesthetics 1 to 1. --> <!-- - use (`gapminder`?) example to connect data to aesthetics (a la Heiss) --> <!-- - introduce types of plots (from the cheatsheet?) --> <!-- - `facet_wrap()`, `facet_grid()` --> <!-- - practice and use examples --> <!-- Then, cases where this is not true, and we need to do some data wrangling. --> <!-- - making a new variable (`mutate()`) --> <!-- - subsetting (`filter()`) --> <!-- - summarizing? `group_by()`, `summarize()` --> <!-- - `pivot_longer()`, `pivot_wider()` --> <!-- but always ending with a figure I guess. --> <!-- And that's basically it. -->